Overview

Dataset Statistics

Number of Variables 12
Number of Rows 50000
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 4.6 MB
Average Row Size in Memory 96.0 B
Variable Types
  • Numerical: 12

Dataset Insights

jaro_distance is skewed Skewed
jaro_winkler_distance is skewed Skewed
overlap_coefficient_distance is skewed Skewed
soft_tfidf_distance is skewed Skewed
partial_ration_distance is skewed Skewed

Variables


levenshtain_distance

numerical

Approximate Distinct Count 50000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 800000
Mean 0.5172
Minimum 1.0044e-06
Maximum 0.9356
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • levenshtain_distance is skewed left (γ1 = -1.0957)

Quantile Statistics

Minimum 1.0044e-06
5-th Percentile 0.006042
Q1 0.4371
Median 0.5637
Q3 0.6572
95-th Percentile 0.7777
Maximum 0.9356
Range 0.9356
IQR 0.2201

Descriptive Statistics

Mean 0.5172
Standard Deviation 0.2068
Variance 0.04278
Sum 25860.6904
Skewness -1.0957
Kurtosis 0.7489
Coefficient of Variation 0.3999
  • levenshtain_distance has 4190 outliers

needleman_wunsch_distance

numerical

Approximate Distinct Count 50000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 800000
Mean 0.6795
Minimum 7.5916e-07
Maximum 1.3417
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • needleman_wunsch_distance is skewed left (γ1 = -0.743)

Quantile Statistics

Minimum 7.5916e-07
5-th Percentile 0.008933
Q1 0.555
Median 0.7162
Q3 0.8585
95-th Percentile 1.0945
Maximum 1.3417
Range 1.3417
IQR 0.3035

Descriptive Statistics

Mean 0.6795
Standard Deviation 0.2859
Variance 0.08175
Sum 33977.2995
Skewness -0.743
Kurtosis 0.414
Coefficient of Variation 0.4208
  • needleman_wunsch_distance has 4008 outliers

affine_gap_distance

numerical

Approximate Distinct Count 50000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 800000
Mean 0.5746
Minimum 3.7484e-06
Maximum 1.0795
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • affine_gap_distance is skewed left (γ1 = -0.9315)

Quantile Statistics

Minimum 3.7484e-06
5-th Percentile 0.006999
Q1 0.4782
Median 0.6154
Q3 0.7272
95-th Percentile 0.8909
Maximum 1.0795
Range 1.0795
IQR 0.2489

Descriptive Statistics

Mean 0.5746
Standard Deviation 0.2345
Variance 0.05498
Sum 28729.3941
Skewness -0.9315
Kurtosis 0.5992
Coefficient of Variation 0.4081
  • affine_gap_distance has 4012 outliers

smith_waterman_distance

numerical

Approximate Distinct Count 50000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 800000
Mean 0.5861
Minimum 5.0238e-06
Maximum 0.9126
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • smith_waterman_distance is skewed left (γ1 = -1.4711)

Quantile Statistics

Minimum 5.0238e-06
5-th Percentile 0.00592
Q1 0.5234
Median 0.6504
Q3 0.7277
95-th Percentile 0.8161
Maximum 0.9126
Range 0.9126
IQR 0.2043

Descriptive Statistics

Mean 0.5861
Standard Deviation 0.2177
Variance 0.04741
Sum 29306.9183
Skewness -1.4711
Kurtosis 1.5324
Coefficient of Variation 0.3715
  • smith_waterman_distance has 4509 outliers

jaro_distance

numerical

Approximate Distinct Count 50000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 800000
Mean 0.9862
Minimum 0.9091
Maximum 0.9955
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • jaro_distance is skewed left (γ1 = -3.567)

Quantile Statistics

Minimum 0.9091
5-th Percentile 0.9729
Q1 0.9847
Median 0.9881
Q3 0.9904
95-th Percentile 0.9929
Maximum 0.9955
Range 0.08635
IQR 0.005706

Descriptive Statistics

Mean 0.9862
Standard Deviation 0.007583
Variance 5.7501e-05
Sum 49308.268
Skewness -3.567
Kurtosis 21.2177
Coefficient of Variation 0.007689
  • jaro_distance is not normally distributed (p-value 1.8157949231103568e-10)
  • jaro_distance has 4068 outliers

jaro_winkler_distance

numerical

Approximate Distinct Count 50000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 800000
Mean 0.1088
Minimum 3.6999e-09
Maximum 0.5162
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • jaro_winkler_distance is skewed right (γ1 = 0.9239)

Quantile Statistics

Minimum 3.6999e-09
5-th Percentile 0.00037654
Q1 0.001922
Median 0.003801
Q3 0.286
95-th Percentile 0.3965
Maximum 0.5162
Range 0.5162
IQR 0.284

Descriptive Statistics

Mean 0.1088
Standard Deviation 0.1593
Variance 0.02539
Sum 5441.9428
Skewness 0.9239
Kurtosis -0.965
Coefficient of Variation 1.464
  • jaro_winkler_distance is not normally distributed (p-value 5.9233392970600865e-25)

overlap_coefficient_distance

numerical

Approximate Distinct Count 50000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 800000
Mean 0.2631
Minimum 3.3556e-07
Maximum 0.8996
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • overlap_coefficient_distance is skewed right (γ1 = 0.2248)

Quantile Statistics

Minimum 3.3556e-07
5-th Percentile 0.001515
Q1 0.007682
Median 0.216
Q3 0.4363
95-th Percentile 0.6131
Maximum 0.8996
Range 0.8996
IQR 0.4286

Descriptive Statistics

Mean 0.2631
Standard Deviation 0.2114
Variance 0.0447
Sum 13157.061
Skewness 0.2248
Kurtosis -1.0951
Coefficient of Variation 0.8034
  • overlap_coefficient_distance is not normally distributed (p-value 6.396448425247998e-19)

generalized_jaccard_distance

numerical

Approximate Distinct Count 50000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 800000
Mean 0.5581
Minimum 1.5676e-06
Maximum 0.952
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • generalized_jaccard_distance is skewed left (γ1 = -1.2544)

Quantile Statistics

Minimum 1.5676e-06
5-th Percentile 0.005935
Q1 0.4957
Median 0.6168
Q3 0.7139
95-th Percentile 0.8029
Maximum 0.952
Range 0.952
IQR 0.2181

Descriptive Statistics

Mean 0.5581
Standard Deviation 0.2204
Variance 0.04856
Sum 27902.5554
Skewness -1.2544
Kurtosis 0.8462
Coefficient of Variation 0.3949
  • generalized_jaccard_distance has 4732 outliers

tfidf_distance

numerical

Approximate Distinct Count 50000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 800000
Mean 0.647
Minimum 6.8601e-06
Maximum 0.9754
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • tfidf_distance is skewed left (γ1 = -1.5651)

Quantile Statistics

Minimum 6.8601e-06
5-th Percentile 0.006282
Q1 0.5865
Median 0.7234
Q3 0.808
95-th Percentile 0.8792
Maximum 0.9754
Range 0.9754
IQR 0.2215

Descriptive Statistics

Mean 0.647
Standard Deviation 0.2365
Variance 0.05593
Sum 32348.7238
Skewness -1.5651
Kurtosis 1.7537
Coefficient of Variation 0.3655
  • tfidf_distance has 4288 outliers

soft_tfidf_distance

numerical

Approximate Distinct Count 50000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 800000
Mean 0.989
Minimum 0.9091
Maximum 0.9983
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • soft_tfidf_distance is skewed left (γ1 = -3.6589)

Quantile Statistics

Minimum 0.9091
5-th Percentile 0.9754
Q1 0.9876
Median 0.9912
Q3 0.9933
95-th Percentile 0.9956
Maximum 0.9983
Range 0.08923
IQR 0.005621

Descriptive Statistics

Mean 0.989
Standard Deviation 0.00774
Variance 5.9908e-05
Sum 49450.9287
Skewness -3.6589
Kurtosis 22.0902
Coefficient of Variation 0.007826
  • soft_tfidf_distance is not normally distributed (p-value 6.208684984785945e-12)
  • soft_tfidf_distance has 4154 outliers

partial_ration_distance

numerical

Approximate Distinct Count 50000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 800000
Mean 0.2896
Minimum 1.5598e-06
Maximum 0.7384
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • partial_ration_distance is skewed left (γ1 = -0.3904)

Quantile Statistics

Minimum 1.5598e-06
5-th Percentile 0.002418
Q1 0.1402
Median 0.3299
Q3 0.415
95-th Percentile 0.526
Maximum 0.7384
Range 0.7384
IQR 0.2748

Descriptive Statistics

Mean 0.2896
Standard Deviation 0.1702
Variance 0.02897
Sum 14482.0308
Skewness -0.3904
Kurtosis -0.9462
Coefficient of Variation 0.5876
  • partial_ration_distance is not normally distributed (p-value 1.0780167518128993e-14)

bag_distance_distance

numerical

Approximate Distinct Count 50000
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 800000
Mean 0.4033
Minimum 3.0794e-06
Maximum 0.8956
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • bag_distance_distance is skewed left (γ1 = -0.2193)

Quantile Statistics

Minimum 3.0794e-06
5-th Percentile 0.005868
Q1 0.2867
Median 0.4108
Q3 0.5339
95-th Percentile 0.7193
Maximum 0.8956
Range 0.8956
IQR 0.2472

Descriptive Statistics

Mean 0.4033
Standard Deviation 0.1941
Variance 0.03766
Sum 20164.0243
Skewness -0.2193
Kurtosis -0.2641
Coefficient of Variation 0.4812
  • bag_distance_distance is not normally distributed (p-value 0.003918983201886058)

Interactions

Correlations

Missing Values